Search CORE

105 research outputs found

Factor Analysis for Multiple Testing (FAMT): An R Package for Large-Scale Significance Testing under Dependence

Author: Chloe Friguet
David Causeur
Maela Kloareg
Magalie Houee-Bigot
Publication venue
Publication date
Field of study

The R package FAMT (factor analysis for multiple testing) provides a powerful method for large-scale significance testing under dependence. It is especially designed to select differentially expressed genes in microarray data when the correlation structure among gene expressions is strong. Indeed, this method reduces the negative impact of dependence on the multiple testing procedures by modeling the common information shared by all the variables using a factor analysis structure. New test statistics for general linear contrasts are deduced, taking advantage of the common factor structure to reduce correlation and consequently the variance of error rates. Thus, the FAMT method shows improvements with respect to most of the usual methods regarding the non discovery rate and the control of the false discovery rate (FDR). The steps of this procedure, each of them corresponding to R functions, are illustrated in this paper by two microarray data analyses. We first present how to import the gene ex- pression data, the covariates and gene annotations. The second step includes the choice of the optimal number of factors, the factor model fitting, and provides a list of selected genes according to a preset FDR control level. Finally, diagnostic plots are provided to help the user interpret the factors using available external information on either genes or arrays.

Research Papers in Economics

Stabilité de la sélection de variables pour la classification de données en grande dimension

Author: Causeur David
Friguet Chloé
Perthame Emeline
Publication venue: HAL CCSD
Publication date: 01/05/2013
Field of study

International audienceLes données à haut-débit ont motivé le développement de méthodes statistiques pour la sélection de variables. Ces données sont caractérisées par leur grande dimension et par leur hétérogénéité car le signal est souvent observé simultanément à plusieurs facteurs de confusion. Les approches habituelles sont ainsi remises en question car elles peuvent conduire à des décisions erronées. Efron (2007), Leek and Storey (2007, 2008), Friguet et al (2009) montrent l'impact négatif de l'hétérogénéité des données sur le nombre de faux-positifs des tests multiples. La sélection de variables est une étape importante de la construction d'un modèle de classification en grande dimension car elle réduit la dimension du problème aux variables les plus prédictives. On s'intéresse ici aux performances de classification de la sélection de variables, via la procédure LASSO (Tibshirani (1996)) et à la reproductibilité des ensembles de variables sélectionnés. Des simulations montrent que l'ensemble des variables sélectionnées par le LASSO n'est pas celui des meilleurs prédicteurs théoriques. Aussi, d'intéressantes performances de classification ne sont atteintes que pour un grand nombre de variables sélectionnées. Notre méthode s'appuie sur la description de la dépendance entre covariables grâce à un petit nombre de variables latentes (Friguet et al. (2009)). La stratégie proposée consiste à appliquer les procédures sur les données conditionnellement à cette structure de dépendance. Cette stratégie permet de stabiliser les variables sélectionnées : d'intéressantes performances de classification sont atteintes pour de plus petits ensembles de variables et les variables les plus prédictives sont détectées

HAL-Université de Bretagne Occidentale

HAL-Rennes 1

Signal identification in ERP data by decorrelated Higher Criticism Thresholding

Author: Causeur David
Perthame Emeline
Sheu Ching-Fan
Publication venue: HAL CCSD
Publication date: 03/05/2016
Field of study

Event-related potentials (ERPs) are intensive recordings of electrical activity along the scalp time-locked to motor, sensory, or cognitive events. A main objective in ERP studies is to select (rare) time points at which (weak) ERP amplitudes (features) are significantly associated with experimental variable of interest. The Higher Criticism Thresholding (HCT), as an optimal signal detection procedure in the " rare-and-weak " paradigm, appears to be ideally suited for identifying ERP features. However, ERPs exhibit complex temporal dependence patterns violating the assumption under which signal identification can be achieved efficiently for HCT. This article first highlights this impact of dependence in terms of instability of signal estimation by HCT. A factor modeling for the covariance in HCT is then introduced to decorrelate test statistics and to restore stability in estimation. The detection boundary under factor-analytic dependence is derived and the phase diagram is correspondingly extended. Using simulations and a real data analysis example, the proposed method is shown to estimate more efficiently the support of signals compared with standard HCT and other HCT approaches based on a shrinkage estimation of the covariance matrix

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Inferring gene networks using a sparse factor model approach, Statistical Learning and Data Science

Author: Blum Yuna
Causeur David
Friguet Chloé
Houée Magali
Lagarrigue Sandrine
Publication venue: HAL CCSD
Publication date: 07/05/2012
Field of study

The availability of genome-wide expression data to complement the measurements of a phenotypic trait opens new opportunities for identifying biologic processes and genes that are involved in trait expression. Usually differential analysis is a preliminary step to identify the key biological processes involved in the variability of the trait of interest. However, this variability shall be viewed as resulting from a complex combination of genes individual contributions. In other words, exploring the interactions between genes viewed in a network structure which vertices are genes and edges stand for inhibition or activation connections gives much more insight on the internal structure of expression profiles. Many currently available solutions for network analysis have been developed but an efficient estimation of the network from high-dimensional data is still a questioning issue. Extending the idea introduced for differential analysis by Friguet et al. (2009) [1] and Blum et al. (2010) [2], we propose to take advantage of a factor model structure to infer gene networks. This method shows good inferential properties and also allows an efficient testing strategy for the significance of partial correlations, which provides an interesting tool to explore the community structure of the networks. We illustrate the performance of our method comparing it with competitors through simulation experiments. Moreover, we apply our method in a lipid metabolism study that aims at identifying gene networks underlying the fatness variability in chickens

HAL-Université de Bretagne Occidentale

HAL-Rennes 1

Factor Analysis for Multiple Testing (FAMT): An R Package for Large-Scale Significance Testing under Dependence

Author: Causeur David
Friguet Chloé
Houee-Bigot Magali
Kloareg Maela
Publication venue: University of California, Los Angeles
Publication date: 01/01/2011
Field of study

The R package FAMT (factor analysis for multiple testing) provides a powerful method for large-scale significance testing under dependence. It is especially designed to select differentially expressed genes in microarray data when the correlation structure among gene expressions is strong. Indeed, this method reduces the negative impact of dependence on the multiple testing procedures by modeling the common information shared by all the variables using a factor analysis structure. New test statistics for general linear contrasts are deduced, taking advantage of the common factor structure to reduce correlation and consequently the variance of error rates. Thus, the FAMT method shows improvements with respect to most of the usual methods regarding the non discovery rate and the control of the false discovery rate (FDR). The steps of this procedure, each of them corresponding to R functions, are illustrated in this paper by two microarray data analyses. We first present how to import the gene expression data, the covariates and gene annotations. The second step includes the choice of the optimal number of factors, the factor model fitting, and provides a list of selected gene according to a preset FDR control level. Finally, diagnostic plots are provided to help the user interpret the factors using a vailable external information on either genes or arrays

Journal of Statistical Software

HAL-Rennes 1

Décorrélation adaptative pour la prédiction en grande dimension

Author: Causeur David
Emily Mathieu
Hébert Florian
Publication venue: HAL CCSD
Publication date: 03/06/2019
Field of study

International audienceIn large-scale signicance analysis, ignoring dependence or not is a core issue, leading to many recent results about the impact of decorrelating the pointwise test statistics. Yet, for the estimation of a prediction model, decorrelating large proles of predicting variables is not as clearly questioned, although many comparative studies have reported the superiority of so-called naive methods, ignoring dependence. Under the usual Gaussian mixture model assumption of Linear Discriminant Analysis, we show that, for a given dependence structure, the classication performance of methods ignoring or not dependence may be markedly dierent, according to the pattern of the association signal between the predicting variables and the response. In order to minimize the largest probability of misclassication, we propose a method handling adaptively the dependence. A simulation study shows that the performance of the present method is at least as good as the best of methods ignoring dependence or based on a complete decorrelation of the predicting variables. 1Dans les procédures de tests en grande dimension, la prise en compte ou non de la dépendance donne lieu à de nombreux développements méthodologiques et discussions , notamment sur l'impact de la décorrélation des statistiques de tests. Pourtant, dans une optique d'estimation d'un modèle pour la prédiction, la question de la décorréla-tion de grands prols de variables prédictrices n'est pas abordée dans les mêmes termes, bien que de nombreuses études comparatives aient rapporté la supériorité de méthodes de prédiction dites naïves, au sens où elles ignorent la dépendance. Sous l'hypothèse clas-sique en analyse linéaire discriminante d'un mélange de lois gaussiennes, nous montrons que pour une structure de dépendance des prédicteurs donnée, les performances de clas-sication ignorant ou non cette dépendance peuvent être très variables et opposées selon la forme du signal d'association entre les prédicteurs et la classe. An de minimiser le risque maximal d'erreur de classication, nous proposons donc une prise en compte adap-tative de la dépendance et montrons sur des simulations que les performances de la règle de classication proposée sont généralement au moins aussi bonnes que la meilleure des règles parmi celles ignorant la dépendance ou au contraire basées sur une décorrélation des prédicteurs

HAL-Rennes 1

Variable selection for correlated data in high dimension using decorrelation methods

Author: Causeur David
Friguet Chloé
Perthame Emeline
Sheu Ching-Fan
Publication venue: HAL CCSD
Publication date: 07/04/2016
Field of study

International audienceThe analysis of high throughput data has renewed the statistical methodology for feature selection. Such data are both characterized by their high dimension and their heterogeneity, as the true signal and several confusing factors are often observed at the same time. In such a framework, the usual statistical approaches are questioned and can lead to misleading decisions as they are initially designed under independence assumption among variables. In this talk, I will present some improvements of variable selection methods in regression and supervised classification issues, by accounting for the dependence between selection statistics. The methods proposed in this talk are based on a factor model of covariates, which assumes that variables are conditionally independent given a vector of latent variables. During this talk, I will illustrate the impact of dependence on the stability on some usual selection procedures. Next, I will particularly focus on the analysis of event-related potentials data (ERP) which are widely collected in psychological research to determine the time courses of mental events. Such data are characterized by a temporal dependence pattern both strong and complex which can be modeled by the mentioned above factor model

INRIA a CCSD electronic archive server

Recommended from our members

Implicit responses in the judgment of attractiveness in faces with differing levels of makeup

Author: Boggio Paulo
Causeur David
Comfort William
de Andrade Bianca
Wingenbach Tanja S. H.
Publication venue: American Psychological Association
Publication date: 01/09/2021
Field of study

Makeup is a form of body art which has been used for over 7000 years and is present in the great majority of human cultures, often used to enhance facial attractiveness and to accentuate features that represent femininity. This study examined how cumulative levels of facial makeup influenced approach and avoidance tendencies and on facial muscle responses associated with emotional response obtained through facial electromyography (EMG) in a passive viewing task. Experiment 1 employed the joystick variant of the approach-avoidance task, where 30 subjects categorised female faces by visual orientation (portrait/landscape) in 7 cumulatively-added makeup levels. In Experiment 2, facial EMG was recorded from 40 subjects in the passive viewing of the same images. The present study shows that makeup application modulates implicit responses and reveals two distinct implicit preferences, behavioural and affective, with a male behavioural preference for heavy eye cosmetics, a female behavioural preference for light makeup, and an overall affective preference in both men and women for makeup accentuating visual contrast in the eye and mouth regions. These results are consistent with the conception that perceptual cues underlying cosmetic enhancement are key determinants in aesthetic facial preferences

Greenwich Academic Literature Archive

HAL-Rennes 1

A transcriptome multi-tissue analysis identifies biological pathways and genes associated with variations in feed efficiency of growing pigs

Author: Causeur David
Gilbert Hélène,
Gondret Florence
Houée-Bigot Magalie
Lagarrigue Sandrine
Louveau Isabelle
Siegel Anne
Vincent Annie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

International audienceBackground - Animal's efficiency in converting feed into lean gain is a critical issue for the profitability of meat industries. This study aimed to describe shared and specific molecular responses in different tissues of pigs divergently selected over eight generations for residual feed intake (RFI). Results - Pigs from the low RFI line had an improved gain-to-feed ratio during the test period and displayed higher leanness but similar adiposity when compared with pigs from the high RFI line at 132 days of age. Transcriptomics data were generated from longissimus muscle, liver and two adipose tissues using a porcine microarray and analyzed for the line effect (n = 24 pigs per line). The most apparent effect of the line was seen in muscle, whereas subcutaneous adipose tissue was the less affected tissue. Molecular data were analyzed by bioinformatics and subjected to multidimensional statistics to identify common biological processes across tissues and key genes participating to differences in the genetics of feed efficiency. Immune response, response to oxidative stress and protein metabolism were the main biological pathways shared by the four tissues that distinguished pigs from the low or high RFI lines. Many immune genes were under-expressed in the four tissues of the most efficient pigs. The main genes contributing to difference between pigs from the low vs high RFI lines were CD40, CTSC and NTN1. Different genes associated with energy use were modulated in a tissue-specific manner between the two lines. The gene expression program related to glycogen utilization was specifically up-regulated in muscle of pigs from the low RFI line (more efficient). Genes involved in fatty acid oxidation were down-regulated in muscle but were promoted in adipose tissues of the same pigs when compared with pigs from the high RFI line (less efficient). This underlined opposite line-associated strategies for energy use in skeletal muscle and adipose tissue. Genes related to cholesterol synthesis and efflux in liver and perirenal fat were also differentially regulated in pigs from the low vs high RFI lines. Conclusions - Non-productive functions such as immunity, defense against pathogens and oxidative stress contribute likely to inter-individual variations in feed efficiency

Springer - Publisher Connector

INRIA a CCSD electronic archive server

A factor model to analyze heterogeneity in gene expression

Author: A Gordon
C Spearman
D Rubin
David Causeur
Guillaume Le Mignon
J Elsen
J Leek
J Leek
M Ashburner
M Kanehisa
P Casel
P Le Roy
P Peltola
R Kustra
Sandrine Lagarrigue
T Miettinen
T MIETTINEN
Yuna Blum
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Microarray technology allows the simultaneous analysis of thousands of genes within a single experiment. Significance analyses of transcriptomic data ignore the gene dependence structure. This leads to correlation among test statistics which affects a strong control of the false discovery proportion. A recent method called FAMT allows capturing the gene dependence into factors in order to improve high-dimensional multiple testing procedures. In the subsequent analyses aiming at a functional characterization of the differentially expressed genes, our study shows how these factors can be used both to identify the components of expression heterogeneity and to give more insight into the underlying biological processes. Results The use of factors to characterize simple patterns of heterogeneity is first demonstrated on illustrative gene expression data sets. An expression data set primarily generated to map QTL for fatness in chickens is then analyzed. Contrarily to the analysis based on the raw data, a relevant functional information about a QTL region is revealed by factor-adjustment of the gene expressions. Additionally, the interpretation of the independent factors regarding known information about both experimental design and genes shows that some factors may have different and complex origins. Conclusions As biological information and technological biases are identified in what was before simply considered as statistical noise, analyzing heterogeneity in gene expression yields a new point of view on transcriptomic data.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ProdInra

HAL-Rennes 1